Qwen3-VL: torch-free numpy processor#61
Open
Blaizzy wants to merge 10 commits into
Open
Conversation
HF's Qwen3-VL image/video processors hard-require torch/torchvision. Inline the numpy port adapted from mlx-vlm (commit 1bf7742, unreleased) so mlx-embeddings can run without torch installed — including real checkpoints like mlx-community/Qwen3-VL-2B-Instruct-4bit. Drops the AutoImageProcessor.from_pretrained(use_fast=False) path, the _UnsupportedVideoProcessor stub, and the object.__new__(Qwen3VLProcessor) trick. Processor.from_pretrained now delegates to the local torch-free Qwen3VLProcessor.from_pretrained, which reads processor_config.json / preprocessor_config.json / video_preprocessor_config.json directly and builds numpy Qwen3VLImageProcessor / Qwen3VLVideoProcessor. Small fixes on top of the mlx-vlm source: - Flatten list-of-list image/video batches (HF's apply_chat_template nests them that way). - Treat explicit None in preprocessor_config.json (min_pixels/max_pixels) the same as missing — the 2B Instruct checkpoint ships nulls alongside valid size.shortest_edge/longest_edge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README examples surfaced two gaps in the port: - Image inputs passed as https:// URLs (the embedding/reranker README examples) hit `FileNotFoundError` because `_to_numpy_image` treated every string as a local path. Detect URLs and fetch via requests. - `Qwen/Qwen3-VL-Reranker-2B` ships its chat template in chat_template.jinja, not in tokenizer_config.json. Add a `_load_qwen_vl_text` helper (local-then-Hub) and fall back to it when neither processor_config.json nor the tokenizer carries a template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both helpers did the same local-then-Hub read; only the parsing differed. Unify as _load_qwen_vl_file that dispatches on the .json suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The folder name already scopes these, so the prefix is noise: - _load_qwen_vl_file -> _load_file - _qwen_vl_image_kwargs -> _image_kwargs - _qwen_vl_video_kwargs -> _video_kwargs Classes keep their qualified Qwen3VL* names since they're the module's public surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the 6-query × 6-image retrieval benchmark, the mlx-embeddings output had max|cosine diff| = 0.087 vs HF transformers reference and only 83% top-1 agreement. Three fixes close the gap to max 0.006 diff and 100% top-1/top-3 agreement: 1. Forward the embedder's MIN_PIXELS/MAX_PIXELS (4096..1,843,200) onto the inner image_processor. The Qwen3-VL preprocessor_config.json lists the full-context size bounds (16 MP), so without this override the image_processor resized to a different grid than the HF reference and the comparison ran on different visual tokens. 2. Work around mlx-vlm bug in Qwen3-VL get_input_embeddings: the upstream assigns `mx.eval(deepstack_image_embeds)` to `deepstack_visual_embeds`, but mx.eval returns None — so multi-scale deepstack features were silently dropped at every LM layer the model was supposed to inject them into. Re-run the vision tower in our Model.get_input_embeddings when we detect this. 3. Patch mlx-vlm's `_deepstack_process` on the language-model instance: upstream indexes the full concatenated visual_embeds at each batch sample's image positions, which only works for batch_size=1. Our patched version slices visual_embeds per sample using a running offset so multi-image batches work. Once (2) is fixed upstream, (3) surfaces immediately — they're stacked bugs that cancel for single-image batches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ugs" This reverts commit 45a501c.
Replaces the toy "embed 4 mixed inputs and print a 4x4 similarity" snippet with a real retrieval workflow: embed an image gallery once, score multiple text queries, and rank top-K per query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
88af1f0 to
967fbaf
Compare
Convert examples/qwen3_vl_retrieval.py into a notebook so the plot renders inline on GitHub (no separate PNG to keep in sync). README now links to the .ipynb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HF's Qwen3-VL image/video processors hard-require torch/torchvision. This PR inlines a numpy + PIL port of both processors into
mlx_embeddings/models/qwen3_vl/processor.pyso mlx-embeddings can run real Qwen3-VL checkpoints without torch installed.Adapted from mlx-vlm's unreleased torch-free port (commit
1bf7742on the local dev tree; not yet in any mlx-vlm PyPI release, so we inline rather than bump the dep).Changes
Qwen3VLImageProcessor+Qwen3VLVideoProcessor(subclasses of HF'sImageProcessingMixin/BaseVideoProcessor, duck-typed to match).Qwen3VLProcessorsubclass of HFProcessorMixinthat overridescheck_argument_for_proper_classto bypass HF's isinstance check againsttransformers.utils.dummy_torchvision_objects.Processor.from_pretrainednow delegates to the localQwen3VLProcessor.from_pretrained, which readsprocessor_config.json/preprocessor_config.json/video_preprocessor_config.jsondirectly.AutoImageProcessor.from_pretrained(use_fast=False)fallback path, the_UnsupportedVideoProcessorstub, and theobject.__new__(Qwen3VLProcessor)workaround. Video inputs are now supported, not stubbed.test_qwen3_vl_processor_from_pretrained_uses_custom_loaderto mock at the newQwen3VLProcessor.from_pretrainedboundary.Fixes on top of the mlx-vlm source
apply_chat_templatenests inputs that way and the upstream__call__crashed on them.Noneinpreprocessor_config.json(min_pixels/max_pixels) the same as missing. The 2B Instruct checkpoint ships nulls alongside validsize.shortest_edge/size.longest_edge; the previous logic let the nulls clobber the valid sizes.Test plan
pytest mlx_embeddings/tests/test_models.py— 16/16 passmlx-community/Qwen3-VL-2B-Instruct-4bit:model.embed({text, image=PIL})returns(1, 2048)bf16model.rerank({query, documents=[{text, image}, ...]})returns sensible scores'torch' in sys.modulesstaysFalseacross load + embed + rerank (venv also has no torch installed, so this is a hard guarantee)Qwen3VLImageProcessorandQwen3VLVideoProcessorare confirmed to be the local numpy classes (not transformers' or mlx-vlm's)🤖 Generated with Claude Code